Markovian domain fingerprinting: statistical segmentation of protein sequences
نویسندگان
چکیده
MOTIVATION Characterization of a protein family by its distinct sequence domains is crucial for functional annotation and correct classification of newly discovered proteins. Conventional Multiple Sequence Alignment (MSA) based methods find difficulties when faced with heterogeneous groups of proteins. However, even many families of proteins that do share a common domain contain instances of several other domains, without any common underlying linear ordering. Ignoring this modularity may lead to poor or even false classification results. An automated method that can analyze a group of proteins into the sequence domains it contains is therefore highly desirable. RESULTS We apply a novel method to the problem of protein domain detection. The method takes as input an unaligned group of protein sequences. It segments them and clusters the segments into groups sharing the same underlying statistics. A Variable Memory Markov (VMM) model is built using a Prediction Suffix Tree (PST) data structure for each group of segments. Refinement is achieved by letting the PSTs compete over the segments, and a deterministic annealing framework infers the number of underlying PST models while avoiding many inferior solutions. We show that regions of similar statistics correlate well with protein sequence domains, by matching a unique signature to each domain. This is done in a fully automated manner, and does not require or attempt an MSA. Several representative cases are analyzed. We identify a protein fusion event, refine an HMM superfamily classification into the underlying families the HMM cannot separate, and detect all 12 instances of a short domain in a group of 396 sequences. CONTACT [email protected]; [email protected].
منابع مشابه
A region-level graph labeling approach to motion-based segmentation
This paper deals with the problem of motion-based segmentation of image sequences. Such partitions are multiple-purpose in dynamic scene analysis. We rst extract a texture-based partition using an unsupervised MRF approach. The regions obtained are then grouped according to a motion-based criterion. This grouping process relies on two motion estimation techniques and exploits contextual informa...
متن کاملIn Silico Characterization of Proteins Containing ARID-PHD Domain and Its Expression in Aeluropus littoralis Halophyte
Abiotic stresses are the most important factors that reduce the yield of crops. In this case, Bioinformatics analysis plays an important role to study genes, and their relatedness as well as prediction their function in response to abiotic stresses. Among all domains, ARID-PHD domain has been identified in plants and animals and has a very significant role in growth regulation, cell cycle, and ...
متن کاملDNA Fingerprinting Based on Repetitive Sequences of Iranian Indigenous Lactobacilli Species by (GTG)5- REP-PCR
Background and Objective: The use of lactobacilli as probiotics requires the application of accurate and reliable methods for the detection and identification of bacteria at the strain level. Repetitive sequence-based polymerase chain reaction (rep-PCR), a DNA fingerprinting technique, has been successfully used as a powerful molecular typing method to determine taxonomic and phylogenetic relat...
متن کاملAgnostic Clustering of Markovian Sequences
Classiication of nite sequences without explicit knowledge of their statistical origins is of great importance to various applications, such as text categorization and biological sequence analysis. We propose a new information theoretic algorithm for this problem based on two important ingredients. The rst, motivated by the \two sample problem", provides an (almost) optimal model independent cr...
متن کاملMolecular analysis of AbOmpA type-1 as immunogenic target for therapeutic interventions against MDR Acinetobacter baumannii infection
Introduction: Acinetobacter baumannii is associated with hospital-acquired infections. Outer membrane protein A of A.baumannii (AbOmpA) is a well-characterized virulence factor which has important roles in pathogenesis of this bacterium. Methods: Based on our PCR-sequencing of ompA gene in the clinical isolates, AbOmpA protein can be categorized into two types, named here type-1 and type-2. We ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Bioinformatics
دوره 17 10 شماره
صفحات -
تاریخ انتشار 2001